50 research outputs found
Polyphonic audio tagging with sequentially labelled data using CRNN with learnable gated linear units
Audio tagging aims to detect the types of sound events occurring in an audio
recording. To tag the polyphonic audio recordings, we propose to use
Connectionist Temporal Classification (CTC) loss function on the top of
Convolutional Recurrent Neural Network (CRNN) with learnable Gated Linear Units
(GLU-CTC), based on a new type of audio label data: Sequentially Labelled Data
(SLD). In GLU-CTC, CTC objective function maps the frame-level probability of
labels to clip-level probability of labels. To compare the mapping ability of
GLU-CTC for sound events, we train a CRNN with GLU based on Global Max Pooling
(GLU-GMP) and a CRNN with GLU based on Global Average Pooling (GLU-GAP). And we
also compare the proposed GLU-CTC system with the baseline system, which is a
CRNN trained using CTC loss function without GLU. The experiments show that the
GLU-CTC achieves an Area Under Curve (AUC) score of 0.882 in audio tagging,
outperforming the GLU-GMP of 0.803, GLU-GAP of 0.766 and baseline system of
0.837. That means based on the same CRNN model with GLU, the performance of CTC
mapping is better than the GMP and GAP mapping. Given both based on the CTC
mapping, the CRNN with GLU outperforms the CRNN without GLU.Comment: DCASE2018 Workshop. arXiv admin note: text overlap with
arXiv:1808.0193
Sound Event Detection with Sequentially Labelled Data Based on Connectionist Temporal Classification and Unsupervised Clustering
Sound event detection (SED) methods typically rely on either strongly
labelled data or weakly labelled data. As an alternative, sequentially labelled
data (SLD) was proposed. In SLD, the events and the order of events in audio
clips are known, without knowing the occurrence time of events. This paper
proposes a connectionist temporal classification (CTC) based SED system that
uses SLD instead of strongly labelled data, with a novel unsupervised
clustering stage. Experiments on 41 classes of sound events show that the
proposed two-stage method trained on SLD achieves performance comparable to the
previous state-of-the-art SED system trained on strongly labelled data, and is
far better than another state-of-the-art SED system trained on weakly labelled
data, which indicates the effectiveness of the proposed two-stage method
trained on SLD without any onset/offset time of sound events
Audio Event-Relational Graph Representation Learning for Acoustic Scene Classification
Most deep learning-based acoustic scene classification (ASC) approaches
identify scenes based on acoustic features converted from audio clips
containing mixed information entangled by polyphonic audio events (AEs).
However, these approaches have difficulties in explaining what cues they use to
identify scenes. This paper conducts the first study on disclosing the
relationship between real-life acoustic scenes and semantic embeddings from the
most relevant AEs. Specifically, we propose an event-relational graph
representation learning (ERGL) framework for ASC to classify scenes, and
simultaneously answer clearly and straightly which cues are used in
classifying. In the event-relational graph, embeddings of each event are
treated as nodes, while relationship cues derived from each pair of nodes are
described by multi-dimensional edge features. Experiments on a real-life ASC
dataset show that the proposed ERGL achieves competitive performance on ASC by
learning embeddings of only a limited number of AEs. The results show the
feasibility of recognizing diverse acoustic scenes based on the audio
event-relational graph. Visualizations of graph representations learned by ERGL
are available here (https://github.com/Yuanbo2020/ERGL).Comment: IEEE Signal Processing Letters, doi: 10.1109/LSP.2023.331923
Rule-embedded network for audio-visual voice activity detection in live musical video streams
Detecting anchor's voice in live musical streams is an important
preprocessing for music and speech signal processing. Existing approaches to
voice activity detection (VAD) primarily rely on audio, however, audio-based
VAD is difficult to effectively focus on the target voice in noisy
environments. With the help of visual information, this paper proposes a
rule-embedded network to fuse the audio-visual (A-V) inputs to help the model
better detect target voice. The core role of the rule in the model is to
coordinate the relation between the bi-modal information and use visual
representations as the mask to filter out the information of non-target sound.
Experiments show that: 1) with the help of cross-modal fusion by the proposed
rule, the detection result of A-V branch outperforms that of audio branch; 2)
the performance of bi-modal model far outperforms that of audio-only models,
indicating that the incorporation of both audio and visual signals is highly
beneficial for VAD. To attract more attention to the cross-modal music and
audio signal processing, a new live musical video corpus with frame-level label
is introduced.Comment: Submitted to ICASSP 202
AI-based soundscape analysis: Jointly identifying sound sources and predicting annoyancea)
Soundscape studies typically attempt to capture the perception and understanding of sonic environments by surveying users. However, for long-term monitoring or assessing interventions, sound-signal-based approaches are required. To this end, most previous research focused on psycho-acoustic quantities or automatic sound recognition. Few attempts were made to include appraisal (e.g., in circumplex frameworks). This paper proposes an artificial intelligence (AI)-based dual-branch convolutional neural network with cross-attention-based fusion (DCNN-CaF) to analyze automatic soundscape characterization, including sound recognition and appraisal. Using the DeLTA dataset containing human-annotated sound source labels and perceived annoyance, the DCNN-CaF is proposed to perform sound source classification (SSC) and human-perceived annoyance rating prediction (ARP). Experimental findings indicate that (1) the proposed DCNN-CaF using loudness and Mel features outperforms the DCNN-CaF using only one of them. (2) The proposed DCNN-CaF with cross-attention fusion outperforms other typical AI-based models and soundscape-related traditional machine learning methods on the SSC and ARP tasks. (3) Correlation analysis reveals that the relationship between sound sources and annoyance is similar for humans and the proposed AI-based DCNN-CaF model. (4) Generalization tests show that the proposed model's ARP in the presence of model-unknown sound sources is consistent with expert expectations and can explain previous findings from the literature on soundscape augmentation
AI-based soundscape analysis: Jointly identifying sound sources and predicting annoyance
Soundscape studies typically attempt to capture the perception and
understanding of sonic environments by surveying users. However, for long-term
monitoring or assessing interventions, sound-signal-based approaches are
required. To this end, most previous research focused on psycho-acoustic
quantities or automatic sound recognition. Few attempts were made to include
appraisal (e.g., in circumplex frameworks). This paper proposes an artificial
intelligence (AI)-based dual-branch convolutional neural network with
cross-attention-based fusion (DCNN-CaF) to analyze automatic soundscape
characterization, including sound recognition and appraisal. Using the DeLTA
dataset containing human-annotated sound source labels and perceived annoyance,
the DCNN-CaF is proposed to perform sound source classification (SSC) and
human-perceived annoyance rating prediction (ARP). Experimental findings
indicate that (1) the proposed DCNN-CaF using loudness and Mel features
outperforms the DCNN-CaF using only one of them. (2) The proposed DCNN-CaF with
cross-attention fusion outperforms other typical AI-based models and
soundscape-related traditional machine learning methods on the SSC and ARP
tasks. (3) Correlation analysis reveals that the relationship between sound
sources and annoyance is similar for humans and the proposed AI-based DCNN-CaF
model. (4) Generalization tests show that the proposed model's ARP in the
presence of model-unknown sound sources is consistent with expert expectations
and can explain previous findings from the literature on sound-scape
augmentation.Comment: The Journal of the Acoustical Society of America, 154 (5), 314
Attention-based cross-modal fusion for audio-visual voice activity detection in musical video streams
Many previous audio-visual voice-related works focus on speech, ignoring the
singing voice in the growing number of musical video streams on the Internet.
For processing diverse musical video data, voice activity detection is a
necessary step. This paper attempts to detect the speech and singing voices of
target performers in musical video streams using audiovisual information. To
integrate information of audio and visual modalities, a multi-branch network is
proposed to learn audio and image representations, and the representations are
fused by attention based on semantic similarity to shape the acoustic
representations through the probability of anchor vocalization. Experiments
show the proposed audio-visual multi-branch network far outperforms the
audio-only model in challenging acoustic environments, indicating the
cross-modal information fusion based on semantic correlation is sensible and
successful.Comment: Accepted by INTERSPEECH 202
Joint Prediction of Audio Event and Annoyance Rating in an Urban Soundscape by Hierarchical Graph Representation Learning
Sound events in daily life carry rich information about the objective world. The composition of these sounds affects the
mood of people in a soundscape. Most previous approaches
only focus on classifying and detecting audio events and scenes,
but may ignore their perceptual quality that may impact humans’ listening mood for the environment, e.g. annoyance. To
this end, this paper proposes a novel hierarchical graph representation learning (HGRL) approach which links objective audio events (AE) with subjective annoyance ratings (AR) of the
soundscape perceived by humans. The hierarchical graph consists of fine-grained event (fAE) embeddings with single-class
event semantics, coarse-grained event (cAE) embeddings with
multi-class event semantics, and AR embeddings. Experiments
show the proposed HGRL successfully integrates AE with AR
for AEC and ARP tasks, while coordinating the relations between cAE and fAE and further aligning the two different grains
of AE information with the AR